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Abstract: We consider the rate of convergence of the expected loss of empirically optimal vector 
quantizers. Earlier results show that the mean-squared expected distortion for any fixed distribution 
' supported on a bounded set and satisfying some regularity conditions decreases at the rate C(logn/n). 

We prove that this rate is actually 0(l/n). Although these conditions are hard to check, we show that 
well-polarized distributions with continuous densities supported on a bounded set are included in the 
^s^j , scope of this result. 
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i— » ; 

0^ ■ 1. Introduction 

(N; 

Clustering is the problem of identifying groupings of similar points that are relatively far one from each 
^-H \ others, or, in other words, to partition the data into dissimilar groups of similar items. For a comprehensive 

■ introduction to this topic, the reader is referred to the monograph of Graf and Luschgy [5]. Isolate meaningful 
groups from a cloud of data is a topic of interest in many fields, from social science to biology. In fact this 
issue originates in the theory of signal processing in the late 40's, known as the quantization issue, or lossy 
data compression (see Gersho and Gray [7] for a comprehensive approach of this topic). More precisely, let 
X\, . . . , X n denote n random variables, independent and identically distributed, drawn from a distribution P 
over M. d , equipped with its Euclidean norm |.||. and let Q denote a fc-quantizer, that is a map from R d to M. d 
such that Card(<5(IR d )) < k. Let c g (M. d ) k be a concatenation of k d-dimensional vectors c\, . . . , Cfe. Without 
loss of generality we only consider quantizers of the type x i — > Ci, where ||a; — a\\ = min J= i & ||x — Cj\\. The 
Cj's are called clusters. To measure how well the quantizer Q performs in representing the source distribution, 
a possible way is to look at 

R(c) = E\\X -Q(X)\\ 2 =E min ||X-e,-| 

j=l, ...,k 

o: 

when EX < oo. The goal here is to find a set of clusters c„, drawn from the data X\,. . . ,X n , whose 
distortion is as close as possible to the optimal distortion R* = inf cg( - R d)fc R(c). To solve the problem, 
most approaches to date attempt to implement the principle of empirical error minimization in the vector 
quantization context. According to this principle, good clusters can be found by searching for ones that 
rN . minimize the empirical distortion over the training data, defined by 

-i n 1 n 

R n (c) = -J2 - Q{x*)) 2 = - E ■ T n J x * - c ^l 2 - 

n n 1 —^'j=i,...,k 

i=l i=l J 

The existence of such empirically optimal clusters has been established by Graf an Luschgy [8| Theorem 
4.12]. Let us denote by c n one of these vectors of empirically optimal clusters. If the training data represents 
the source well, c n will hopefully perform near optimally also on the real source. Roughly, this means that 
we expect i?(c„) « R*. The problem of quantifying how good empirically designed clusters are, compared 
to the truly optimal ones, has been extensively studied, see for instance Linder |10| . 

To reach the later goal, a standard route is to exploit the Wasserstein distance between the empirical 
distribution and the source distribution, to derive upper bounds on the average distortion of empirically 
optimal clusters. Following this approach, Pollard [T5] proved that if E||X|| 2 < oo, then R(c n ) — R* — > 
almost surely, as n — > oo. More recently, Linder, Lugosi and Zeger [TT], and Biau, Devroye and Lugosi [3] 
showed that if the support of P is bounded, then E (R(c n ) — R*) = 0(l/y/n), using techniques borrowed 
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from statistical learning theory. Bartlett, Linder and Lugosi [2] established that this rate is minimax over 
distributions supported on a finite set of points. 

However, faster rates can be achieved, using methods inspired from statistical learning theory. For example, 
it is shown by Chou [6 , following a result of Pollard [TB], that R(c n ) — R* = Op (1/n), under some regularity 
conditions on the source distribution. Nevertheless, this consistency result does not provide any information 
on how many training samples are needed to ensure that the average distortion of empirically optimal clusters 
is close to the optimum. Antos, Gyorfi and Gyorgy established in [T] that E(i?(c„) — R* ) = C(log n/n) under 
the same conditions, paying a logn factor to derive a non-asymptotic bound. It is worth pointing out that 
the conditions cannot be checked in practice, and consequently remain of theoretical nature. Moreover, the 
rate of 1/n for the average distortion can be achieved when the source distribution is supported on a finite 
set of points. Consequently, an open question is to know wether this optimal rate can be attained for more 
general distributions. 

In the present paper, we improve previous results of Antos, Gyorgy and Gyorfi [1] , by getting rid of the log n 
factor. Besides, we express Pollard's condition in a more reader-friendly framework, involving the density 
of the source distribution. To this aim we use statistical learning arguments and prove that the average 
distortion of empirically optimal clusters decreases at the rate 0(l/n). To get this result we use techniques 
such as the localization principle borrowed from Massart, Blanchard and Bousquet [3] or Koltchinskii [5]. 
The condition we offer can be easily interpreted as margin-type condition, similar to the ones of Massart 
and Nedelec in |13| . showing a clear connection between statistical learning theory and vector quantization. 

The paper is organized as follows. In Section 2 we introduce notation and definitions of interest. In Section 
3 we offer our main results. These results are discussed in Section 4, and illustrated on examples such as 
Gaussian mixtures or quasi-finite distribution. Finally, proofs are gathered in Section 5. 

2. The quantization problem 

Throughout the paper, Xi, . . . ,X n is a sequence of independent M d -valued random observations with the 
same distribution P as a generic random variable X. To frame the quantization problem as a statistical 
learning one, we first have to consider quantization as a contrast minimization issue. To this aim we introduce 
the following notation. Let c = (ci, . . . , c„) be the set of possible clusters. The contrast function 7 is defined 
as 

J (R d ) fe xR d — > K 

^ | (c, x) 1 — 5- min — Cj\\ 2 

K j=l,...,k 

Within this framework, the risk R(c) takes the form R(Q) = R(c) — Pj(c,.), where Pf(.) means 
inegration of the function / with respect to P. In the same way, if P n denotes the empirical distribution that 
is induced on R d by the n-sample Xi, . . . , X n , we can express the empirical risk R n (Q) as P n -f(c, .). 

Note that, within this context, an optimal c* minimizes Pj(c, .), whereas c„ € argmin cS ( R d)fc P„ 7 (c, .). It 
is worth pointing out that the existence of both c and c* are guaranteed by Graf and Luschgy [SJ Theorem 
4.12]. In the sequel we denote by M. the set of such minimizers of the true risk Pj(c, .), so that c* £ Ai. To 
measure how well a vector of clusters c performs compared to an optimal one, we will make use of the loss 

£(c, c*) = R(c) - R(c*) = P ( 7 (c, .) - 7 (c*, .)) . 

Troughout the paper we will use the following assumptions on the source distribution. Let B(0, M) denote 
the closed ball of radius M, with M > 0. 

Assumption 1 (Peak Power Constraint). The distribution P is such that P(B(0, 1)) = 1, 

Note that Assumption 1 is stronger than the requirement E ||-Y 2 || < 00, as it imposes a Loo-boundedness 
condition on the random variable X. For conveniency we assume that the distribution is bounded by 1. 
However, it is important to note that our results hold for random variables X bounded from above by an 
arbitrary M. We will also need the following regularity requirement, first introduced by Pollard |16| . 
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Assumption 2 (Pollard's regularity condition). The distribution P satisfies the following two condi- 
tions: 

1. P has a continuous density f with respect to Lebesgue measure on M. d , 

2. The Hessian matrix of c i — > Pj(c, .) is positive definite for all optimal vector of clusters c* . 

One can point out that Condition 1 of Assumption 2 does not guarantee the existence of a second derivative 
for the expectation of the contrast function. Nevertheless Assumption 1 and Condition 1 of Assumption 2 are 
enough to guarantee that the map c i — > Pj(c, .) is twice differentiable. Let Vi be the Voronoi cell associated 
with Ci, for % = 1, . . . , k. In this situation, the Hessian matrix is composed of the following d x d blocks: 

H(c)i ■ = { 2P ^ ~ 2 r ^ 1(T [f( x )t x ~ Ci ^ x ~ CiYidWnVt)] for « = 3 
\ -2nf x a [f{x)(x - a)(x - Cj)*ls(vinVf)] for i 3 

where = ||cj — Cj\\, d(V, D Vj) denotes the possibly empty common face of V, and Vj, and a means 
integration with respect to the (d— l)-dimensional Lebesgue measure. For a proof of that statement, we refer 
to Pollard [To] , 

When Assumption 1 and Assumption 2 are satisfied, Chou [H] proved that £(c n ,c*) = Op(l/n), whereas 
Antos, Gyorfi, and Gyorgy established that E£(c„,c*) < C^^ 1 , where C is a constant depending on the 
distribution P. 

The proof of these two results are both based on arguments which have a connection with the localization 
principle ([12], [S]), which provides faster rates of convergence when the expectation and the variance of 
7(c, .) — 7(c*, .) are connected. To prove his result, Pollard used conditions under which the distortion and 
the Euclidean distance are connected, and used chaining arguments to bound from above a term which 
looks like a Rademacher complexity, constrained on an area around an optimal vector of clusters. Note that 
Koltchinskii [S] used a similar method to apply the localization principle. On the other hand, Antos, Gyorfi 
and Gyorgy exploited Pollard's condition, and used a concentration inequality based on the fact that the 
variance and the expectation of the distortion are connected to get their result. Interestingly, this point 
of view has been developped by Blanchard, Bousquet and Massart [4] to get bounds on the classification 
risk of the SVM, using the localization principle. That is the approach that will be followed in the present 
document. 

3. Main results 

We are now in a position to state our main result. 

Theorem 3.1. Assume that Assumption 1 and Assumption 2 are satisfied. Then, denoting by c„ an empirical 
risk minimizer, we have 

Ee(c n ,c*)<—, 

n 

where Cq is a positive constant depending on P, k and d. 

This result improves previous non-asymptotic results of Antos, Gyorfi and Gyorgy [lj, Linder, Lugosi and 
Zeger [TT|, showing that a convergence rate of 1/n can be achieved in expectation. To prove Theorem 3.1, the 
key result is based on a version of Talagrand's inequality due to Bousquet [5] and its application to localiza- 
tion, following the approach of Massart and Nedelec [T3]. The main point is to connect Var (7(0, .) — 7(0*, .)) 
to P (7(0, .) — 7(c*, .)) for all possible c. To be more precise, Pollard's condition involves differentiability of 
the distortion, therefore 7(0, .) — 7(c*, .) is naturally linked to ||c — c*||, the Euclidean distance between c 
and c*. However, it is noteworthy that, mimicing the proof of Antos, Gyorfi and Gyorgy [H Corollary 1], we 
have in fact: 



Proposition 3.1. Suppose that Assumption 1 and Assumption 2 are satisfied. Then there exists two positive 
constants A\ and A2, depending on the distribution P , such that 
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1. (HI) : Vc e B(0, l) k ||c-c*(c)|| 2 < A^c, c*(c)), 

2. (H2) : Vc € 6(0, l) fc Vc* € X Var( 7 (c, .) - 7 (c*, .)) < A 2 ||c - c*|| 2 ; 

where c*(c) £ argmin||c — c*|| . 

When considering several possible optimal vector of clusters, we have to choose one to be compared with 
our empirical vector c„. A nearest optimal vector of clusters c*(c n ) is a natural choice. It is important 
to note that, for every c € (R d ) k and c* e M, £(c,c*(c)) — £(c, c*). Consequently, Theorem 3.1 holds 
for every possible c* e M.. Besides it is easy to see, using the compacity of 6(0, 1), that there is only 
a finite set of optimal clusters c* when Assumption 1 is satisfied and the Hessian matrixes H(c*) are 
definite positive for every possible c*. This compacity argument is also the key to turn respectively the 
local positiveness of H(c*) into property (HI) and the regularity of the contrast function 7 into the global 
property (H2). These two properties are exactly matching the two parts of the proof of Antos, Gyorfi and 
Gyorgy [TJ Corollary 1], which in turn implies Proposition 3.1. Note also that, from Proposition 3.1 we get 
Var(7(c, .) — 7(0* (c), .)) < A\A2i(c, c* (c)). This allows us to use localization techniques such as in the paper 
of Blanchard, Bousquet and Massart pQ. 

Pollard's regularity condition (Assumption 2) involves second derivatives of the distortion. Consequently, 
checking Assumption 2, even theoretically, remains a hard issue. We give a more general condition regard- 
ing the Loo-norm of the density / on the boundaries of Voronoi diagram, for the distribution to satisfy 
Assumption 2. We recall that M. denotes the set of all possible optimal clusters c*. 

Theorem 3.2. Denote by V* the Voronoi cell associated with c* in the Voronoi diagram associated with 
c* , by N* the union of all possible boundaries of Voronoi cells with respect to all possible optimal vector of 
clusters c* , and by T the Gamma function. Let B = inf ||c* — c*||. Suppose that 

ll/i-IU<^ c , e ^,.../(^)- 

Then P satisties Assumption 2. 

The proof is given in Section 5. It is important to note that, for general distributions supported on 
B(0, M), we can state a similar theorem, involving M d+1 in the right-hand side of the inequality in Theorem 
3.2. However, a source distribution supported on B(0, M) can be turned into a distribution supported on 
B(0, 1), using an nomothetic transformation. Therefore we will only state results for a distribution supported 
on 6(0,1). 

This theorem emphasizes the idea that if P is well concentrated around its optimal clusters, then some 
localization conditions can hold and therefore it is a favorable case. The intuition behind this result is given 
by the extremal case where Voronoi cells boundaries are empty with respect to P. This case is described in 
detail in Section 4. Moreover, the notion of a well-concentrated distribution looks like margin-type conditions 
for the classification case, as described by Massart and Nedelec [TJ] . This confirms the intuition of an easy- 
to-quantize distribution, when the poles are well-separated. 

4. Discussion and examples 
4-1 ■ Minimax lower bound 

Let V denote the set of probability distributions on B(0, 1). Bartlett, Linder and Lugosi [2 J offered a minimax 
lower bound for general distributions: 

/ fc l- 4 /<i 

supl£(c n ,c*) > c Q \ — . 

pev V n 

Consequently, for general distributions, this minimax bound mathches the upper bound on E£(c„, c*) Linder, 
Lugosi, and Zeger [11 J obtained. A question of interest is to know whether the rate of 1/n we get in Theorem 
3.1 is minimax over the set of distributions which satisfies Assumption 1 and Assumption 2. Proposition 4.1 
below answers this question. 
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Proposition 4.1. Let n be an integer, and denote by T> the set of distributions P satisfying Assumption 1 
and Assumption 2. Then there exists a constant cq and an integer k n such that, for any k n -point quantizer 

/ 

supE£(c„,c*) > cq\ — . 

P<EV V n 

There is no contradiction between Theorem 3.1 and Proposition 4.1. In fact, in Theorem 3.1, k is fixed, 
whereas, in Proposition 4.1, k n strongly depends on n. Therefore, it is an interesting point to know whether 
we can get such a minimax bound when k is fixed. 

The proof of Proposition 4.1 follows the proof of Bartlett, Linder and Lugosi Theorem 1], and it is 
therefore omitted in this paper. The main idea is to replace the distribution supported on 2n points proposed 
by these authors in Step 3 of the proof, with a distribution supported on 2n small balls satisfying Assumption 
1 and Assumption 2. 



4-2. Assumption 1 is necessary 

The original result of Pollard [TB] assume only that E||X|| 2 < oo, to get an asymptotic rate of Op (1/n). 
Consequently, it is an interesting question to know whether Assumption 1 can be replaced with the assump- 
tion E||A|| 2 < oo in Theorem 3.1. In fact, Assumption 1 is useful to get a global localization result from a 
local one, through a compacity argument. This is precisely the result of Proposition 3.1, which provides us 
with the global argument required for applying some localization result from a local regularity condition. 
However, following the idea of Antos, Gyorfi and Gyorgy [T], it is possible to suppose only that E||A|| 2 < oo 
and nevertheless get (H2) in Proposition 3.1, as expressed in the following result. 

Proposition 4.2. Suppose that E||A|| 2 < oo and that the set of all possible optimal clusters c* is finite. 
Then there exists a constant Ai, depending on P , such that 

Vc^c* Var( 7 (c,.)- 7 (c*,.)) < A 2 \\c - c*|| 2 . 

A proof of Proposition 4.2 can be directly deduced from the proof of [1] Theorem 2], Consequently it is 
omitted in this paper. According to Proposition 4.2, we can expect to control the variance of our process 
indexed by c and c* with the Euclidean distance ||c — c*||, even if the support of P is not contained within 
a ball. Unfortunately, when the distribution is not supported on a bounded set, there are cases where the 
term ^(c,c*(c)) cannot dominate ||c — c*(c)|| 2 for all c, as expressed in the following counter-example. Let 

rj > 0, q(r)) = — — ^ , and define the density / of the distribution P supported on M. by 



i if xe[Q,r]} 

f , ^ if x G [R, R + n] 

S q(v)e- x if x>2R-l + a ■ 

elsewhere 

Proposition 4.3. Set n — 2, R — 10, and define c„ = (0,n,n 2 ). Then 
(i) P satisfies Assumption 2. 

(ii) We have £(c n , c*(c n )) — >-P||x|| 2 < oo as n — > oo. 
(Hi) We have \\c n — c*(c„)|| 2 ^n 4 as n — > oo. 

One easily deduces from Proposition 4.3 that the distribution P satisfies Assumption 2, but fails to satisfy 
(HI) in Proposition 3.1. Therefore Assumption 1 is necessary to get the result of Theorem 3.2. The intuiton 
behind this counter-example is that two phenomenons prevent t(c, c*(c)) from being at most proportional to 
||c — c* (c) || 2 when c is arbitrarily far from 0. Firstly, the underlying measure "erase" the Euclidean distance 
in the expression of £(c, c*(c)), which implies that £(c n ,c* (c n )) converges. Therefore, a suitable criterion 
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to link £(c, c*(c)) and Var(7(c, .) — 7(c*'(c), .)) should probably involve a weight drawn from the tail of P. 
Typically we expect such a criterion to be a function of ||c — c*(c)|| 2 , taking into account a tail constraint 
on the distribution P we consider. Secondly, this example shows that, if for instance we take the 3-quantizer 
c„ = [n, n 2 ,n 4 ), the relative loss i(c n , c*(c n )) will mostly depend on the contribution of the smallest cluster 
n, when n grows to infinity, whereas ||c„ — c*(c„)|| 2 essentially depends on the distance to the most far from 
cluster n 4 . 

To conclude, the Euclidean distance does not take into account the weight induced by the underlying 
distribution over the space. Thus, when Assumption 1 is released, dominant clusters for the Euclidean 
distance from c* are essentially the most far ones. On the other hand, when integrating with respect to P, 
far-from-zero clusters loose their influence in the loss £(c n , c*(c„)). 



4-3. A toy example 

In this subsection we intend to understand which conditions on the density / can guarantee that the Hessian 
matrixes H are positive. To this aim we consider an extremal case, in which the probability distribution is 
supported on small balls scattered in B(0, 1). Roughly, if the balls are small enough and far one from each 
others, the optimal quantization points should be the center of these balls. These are the ideas which are 
behind the following proposition. 

Proposition 4.4. Let zi, . . . , zu be vectors in K d . Let p be a positive number and R = inf \\zi — Zj\\ be the 
smallest possible distance between these vectors. Let the distribution P be defined as follows 

[ P \B(zi,p) ~U\B( Zi ,p) 

where U\B{ Zi .p) denotes the uniform distribution over B{zi,p). Then, if (-j — 3p) 2 > ^r§, the optimal k- 
centroid vector is (zx, • ■ • , 

The proof of Proposition 4.4, which is given in Section 5, is inspired from a proof of Bartlett, Lindcr 
and Lugosi [2, Step 3]. It is interesting to note that Proposition 4.4 can be extended to the situation where 
we assume that the underlying distribution is supported on k small enough subsets. In this context, if each 
subset has a not too small P-measure, and if those subsets are far enough one from each others, it can be 
proved in the same way that an optimal quantizer has a point in every small subset. 

Let us now consider the distribution described in Proposition 4.3, with relevant values for p and R. We 
immediatly see that if R/2 > p, then every boundary of the Voronoi diagram for the optimal vector of 
clusters lies in a null-measured area. Thus, for this distribution, 



H(c*) = 



(\u ••• o 



which is clearly positive. 

This short example illustrates the idea behind Theorem 3.2. Namely, if the density of the distribution is 
not too big at the boundaries of the Voronoi diagram associated with every optimal fc-quantizer, then the 
Hessian matrix H will roughly behave as a positive diagonal matrix. Thus Pollard's condition (Assumption 
2) will be satisfied. This most favorable case is in fact derived from the special case where the distribution is 
supported on k points. Antos, Gyorfi and Gyorgy [1] proved that if the distribution has only a finite number 
of atoms, then the expected distortion l(c n ,c*) is at most C/n, where C is a constant. Here we spread the 
atoms into small balls to give a density to the distribution and match regularity conditions. 



4-4- Quasi- Gaussian mixture example 

The aim of this subsection is to apply our results to the Gaussian mixtures in dimension d = 2. However, 
since the distribution support of a Gaussian random variable is not bounded, we will restrict ourselves to 
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the "quasi-Gaussian mixture" model, which is defined as follows. 
Let the density / of the distribution P be defined by 



2 — 1 

where JVi denotes a normalization constant for each Gaussian variable. To ensure this model to be close to 
the Gaussian mixture model, we assume that there exists a constant e € [0, 1] such that, for i = 1, . . . , k, 
Ni > 1 — e. Denote by B = infill m,j — rrij\\ the smallest possible distance between two different means of 
the mixture. To avoid boundary issues we suppose that, for alH = 1, . . . , k, B(rrii, B/3) C £>(0, 1). For such 
a model, we have: 

Proposition 4.5. Suppose that 

Pmin . / 288fc ( x 2 96fc \ 
> max = ^— — , = — =- . 

Pmax ~ \ (1 - e)P 2 (l - e -s 2 /288^) (1 _ £ ) a 2 B ( e B/™° 2 -I) J 

Then P satisfies Assumption 2. 

The inequality we propose as a condition in Proposition 4.5 can be decomposed as follows. If 

Pmin > 288fccr 2 

Pmax ~ (l-£)B2(l_ e -S 2 /288^)' 

then the optimal vector of clusters c* is close to the vector of means of the mixture m = (mi, . . . , m.^). 
Knowing that, we can locate the Vorono'i boundaries of the Vorono'i diagram associated to c* and apply 
Theorem 3.2. This leads to the second term of the maximum in Proposition 4.5. 

This condition can be interpreted as a condition on the polarization of the mixture. A favorable case for 
vector quantization seems to be when the poles of the mixtures are well-separated, which is equivalent to 
a is small compared to B when considering Gaussian mixtures. Proposition 4.5 just explained how a has 
to be small compared to B, in order to satisfy Assumption 2 and therefore apply Theorem 3.1, to reach an 
improved convergence rate of 1/n for the loss £(c n , c*). Notice that Proposition 4.5 can be considered as an 
extension of Proposition 4.4. In these two propositions a key point is to locate c*, which is possible when 
the distribution P is well-polarized. The definition of a well-polarized distribution takes two similar forms 
when looking at Proposition 4.4 or Proposition 4.5. In Proposition 4.4 the favorable case is when the poles 
are far one from each other, separated by an empty area with respect to P, which ensures that the Hessian 
matrixes H(c*) are positive definite (in this case they are diagonal matrixes). When slightly disturbing the 
framework of Proposition 4.4, it is quite natural to think that the Hessian matrixes H(c*) should remain 
positive definite. Proposition 4.5 is an illustration of this idea: the empty separation area between poles is 
replaced with an area where the density / is small compared to its value around the poles. The condition 
on a and B we offer in Proposition 4.5 gives a theoretical definition of a well-polarized distribution for 
quasi-Gaussian mixtures. 

It is important to note that our result holds when k is known and match exactly the number of components 
of the mixture. When the number of cluster k is larger than the number of components k of the mixture, 
we have no general idea of where the optimal clusters can be placed. Moreover, suppose that we are able to 
locate the optimal vector of clusters c*. As explained in the proof of Proposition 4.5, the quantity involved 
in Proposition 4.5 is in fact B = inf^- ||c* — c*|J. Thus, in this case, we expect B to be much smaller than 
B. Consequently, a condition like in Proposition 4.5 could not involve the natural parameter of the mixture 
B. 

The two assumptions TV, > 1 — e and B(m i} B/3) C B(0, 1) can easily be satisfied when P is constructed 
via an homothetic transformation. To see this, take a generic Gaussian mixture on R 2 , denote by fhi,i = 
1, . . . , k, its means and by a 2 its variance. For a given e > 0, choose M > such that, for alH = 1, . . . , k, 
is(o,M) e-W x - m ^ 2 ' 2a2 alx > 2tr7 2 (1 - e) and B(m,i,B/3) C B(0,M). Denote by P Q the "quasi-Gaussian 
mixture" we obtain on B(0, M) for such an M. Then, applying an homothetic transformation with coefficient 
1/M to Po provides a quasi-Gaussian mixture on B(0, 1), with means m, = rfii/M, i = 1, . . . , k and variance 
a 2 = a 2 /M 2 . This distribution satisfies both A, > 1 - e and B{m u B/3) C B(0, 1). 
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5. Proofs 



5.1. Proof of Theorem 3.1 

The proof strongly relies on the localization principle and its application by Blanchard, Bousquet and Massart 
[4]. We start with the following definition. 

Definition 5.1. Let $ be a real-valued function. $ is called a sub-a function if and only i/$ is non- decreasing 
and if the map x i— > <&{x)/x a is non-increasing. 

The next theorem is an adaptation of the result of Blanchard, Bousquet and Massart [3J Theorem 6.1]. 
For the sake of clarity its proof is given in Subsection 5.2. 

Theorem 5.1. Let J- be a class of bounded measurable functions such that 

(i) V/eJ ll/IL<&, 

(ii) V/eJ Var(/)<w(/). 

Let K be a positive constant, $ a sub-a function, a G [1/2, 1[. Then there exists a constant C(a) such that, 
if D is a constant satisfying D < QKC{a) and r* is the unique solution of the equation $(r) = r/D, the 
following holds. Assume that 

Vr>r*, E( sup (P - P n )f ) < $(r). 
\uU)<r J 

Then, for all x > 0, with probability larger than 1 — eT x , 

y/ e r, pi- P ,j < L f) . \ * - , <»" 2 + ^ 



V D J 4n J 

This theorem emphasizes the fact that if we are able to control the variance and the complexity term 
controlled by the variance, we can get a possibly interesting oracle inequality. Obviously the main point is 
to find a suitable control function for the variance of the process. Here the interesting set is 

T = { 7 (c, .) - 7 (c*, .), c € 6(0, l) k , c* eM). 

According to Section 3 the relevant control function for the variance of the process 7(0, .) — 7(0*,.) is 
proportional to ||c — c*|| 2 . Thus it remains to bound from above the quantity 

• ( sup (P n -P)( 7 (c*,.)- 7 (c,.)) j . 

Yc*e.M,||c-c*|| 2 <(5 J 

This is done in the following proposition. 

Proposition 5.1. Suppose that P satisfies Assumption 1. Then 

E{ sup (P„-P)( 7 (c*,.)- 7 (c,.))) <VS-, 

\c'£M,\\c-c'\\ 2 <6 J n 

where C is a constant depending on k, d, and P. 

Assuming that Assumption 1 and Assumption 2 are satisfied, we can apply Theorem 5.1, with iu(c, c*) = 
A 2 ||c-c*|| 2 and b = 2. 

Lemma 5.1. Let D > 0. For all c* e M, x > and K > D/7, if r* is the (unique) solution of&(5) = S/D, 
then we have, with probability larger than 1 — e~ x , 



(P - P„)(7(c, •) - 7(cV)) < X-Mallc - c 



-1a 11- „*i.2 , , , K + 18 



D 2 



-x. 



We are now in a position to prove Theorem 3.1. Take c* = c*(c), a nearest optimal vector of clusters to 
c, and use {H2) to connect ||c — c* (c)|| 2 to £(c, c* (c)). Introducing the explicit form r* — — ® , we get, with 
K = 2AiA 2 , D — 6K, and probability larger than 1 — e~ x 



1/2(P - P„)( 7 (c„, .) - 7 (c*(c), .)) < + -^—x. 

Observing that P n (7(c n , .) — 7(c*(c), .)) < 0, and taking expectation leads to, for all c* e M, 

n 

for some constant Co > depending only on k, d, and P. 
5.2. Proof of Theorem 5.1 

This proof is a modification of the proof of Blanchard, Bousquet and Massart [?J Theorem 6.1]. For r > 0, 

set 

Vr = sap (P-Pn) ( l, • 

We start with a modified version of the so-called peeling lemma: 

Lemma 5.2. Under the assumptions of Theorem 5.1, there exists a constant C(a) depending only on a such 
that, for all r > 0, 

E(V r ) < C(a)-^. 

r 

Furthermore, we have C(a) — > oo. 

a— ¥ 1 

Proof of Lemma 5.2. Let x > 1 be a real number. Because € Conv(J r ), 

sup (P-P„) — J— < sup (P-P n )— L— +Y, sup (P-Pn) r/r ■ 

Taking expectation on both sides leads to 

r r 1 + x K ) 

fc>0 v y 

Recalling that $ is a sub-a function, we may write $(rx fc+1 ) < x a ( k+1 '$(r). Hence we get 



r 

fc>0 



2 ' a; 1 -" 

Taking C(a) = inf (l + x Q U + j-i-i,! ^ proves the result. 



□ 



We are now in a position to prove Theorem 5.1. Using the inequality of Talagrand for a supremum of 
bounded variables that Bousquet [5] offered, we have, with probabilty larger than 1 — e~ x , 



, x „ xbE( r ) bx 

V r < E(V r ) + + 2 \ — + W 

2rn V nr orn 
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Using Lemma 5.2 and the inequality (a + b) 2 < 2(a 2 + b 2 



Vr< 2C(aMr) + /H + 4te 
r y 2rn 3 rrt 

Let r* be the solution of $(r) = If r > r* , then — ^ < ( ^— I -p. For such an r we have 

V r < A ir -^- a) + A 2 r~ 1 ^ + A 3 r-\ 

with 



Ai = 
A 2 = 
A 3 = 



2C(a)(r*Y 



D 



x 
2~n 

Abx 
3n 



We want to find a suitable r such that r > r* and V r < 1/K. To this aim, it suffices to see that if 
r > (3KAi) i-« + (3i\rA 2 ) 2 + 31^^43, and r > r*, then V^. < 1/-JT using the previous upper bound on V r . 

It remains to check that the condition (3KAi)i=" + (3KA2) 2 + 3KA 3 > r* holds. To see this just recall 
that 

\ * f6KC(a)\- ! 

(3KAi) 1-° = r* x ' 1 



V 

Thus, we deduce that, if D < QKC(a), the choice r = (3i\ A^t^ + (3KA 2 ) 2 + 3KA 3 guarantees V r < if" 1 
and, consequently, 



(9K 2 + 16Kb)x 
An 



5. 3. Proof of Proposition 5. 1 

Using the differentiability of Pj(c, .), we get, for any c 6 (M. d ) k and c' eM, 

7(c,ar) =7(c*,a;) + (c - c*), A(c*,a:)) + ||c - c*||P(c*, c - c*,x), 
where, with use of Pollard's [16] notation 

A(c* , x) = -2((x - cfil v . , (X - Cjjlvy) 

R(c*,c-c*,x) = J-vylv^ llc-c*!!" 1 [2(c J - Cj )*a;+||c*|| 2 -2( C *)*c J + 

Observe that, because M. is a finite set, by dominated convergence Theorem, 

P(c*,c- c*,.) when (c-c*) ->■(). 

a.s 

Splitting the expectation in two parts, we obtain 



E sup (P n -P)( 7 (c*,.)-7(c,.)) <E sup (P n -P)(-(c-c*),A(c*,.)) 

\ c*e.A4,||c-c*|| 2 <<5 / \c*eA1,||c-c*|| 2 <(5 



V^E sup (P„-P)(-P(c*,c-c*,.)) 

V c*&M,||c— c* | 2 <<5 



(1) 



:= A + B. 
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5.3.1. Term A: complexity of the model 

Term A in inequality {lj is at first sight the dominant term in the expression $((5). The upper bound we 
obtain below is rather accurate, due to the finite-dimensional Euclidean space structure. Indeed, we have to 
bound a scalar product when the vectors are contained in a ball, thus it is easy to see that the largest value 
of the product matches in fact the largest value of the coordinates of the gradient term. We recall that A4 
denotes the finite set of optimal vector of clusters. Let x = (x%, . . . , Xk) be a vector in (R d ) fe . We denote by 
Xi r the r-th coordinate of Xi, and name it the (i, r)-th coordinate of x. We may write 



sup (c-c*,(P„-P)(-A(c*))) <2 sup sup 

c'eM,\\c-c*\\<VS c*£M j=l,...,k,r=l 



1 ™ 

i=i 

< (c J>)B , c . -c*,(P n -P)(-A(c*,.))}, 



where c* expresses the maximum and thus, have the largest possible coordinate absolute value for (P„ — 
P)(— A(c*))). Moreover, we denote by (j,r) the coordinate of this largest possible absolute value, e the sign 
of its (J, r)-th coordinate, and Cj tr ^ £tC * = c* +ej^ r ^ E , where e.j <r>e is the vector with ey/S for its (J, r) coordinate, 
elsewhere. Therefore we can reduce the set of the c's of interest to a finite set, writing 

sup (c-c*,(P„-P)(-A(c*))) < sup sup (c i , r , e , c *-c*,(P n -P)(-A(c*,.))}. 

c*EM,\\c-c»\\<s/5 ceM j=l,...,k,r=l,...,d,e=±l 

Taking into account that for every c* in J\4, PA(c*,.) = 0, and that for every fixed c and c*, the quan- 
tity (c — c*, P n (— A(c*, .))) is a sub-Gaussian random variable with variance 165/n, we get, by a maximal 
inequality (Massart, [TH Part 6.1]): 



E( sup (P n -P)(-(c-c*),A(c*,.))) < (V21og(2 x Vs. 

\c*eM,||c-c*|| 2 <(5 J \ n J 

Therefore, the expected dominant term involves the complexity of the model in a way which is proportional 
to the square root of the complexity. In our case, this complexity is the dimension of the vector of clusters 
space. 



5.3.2. Bound on B 

To bound the second term in inequality ([1]) , we follow the approach of Pollard |16) , using complexity argu- 
ments such as Dudley's entropy integral. 

Let F be a set of functions defined on X with envelope F. Let S be a finite set and / a function. We 

1/2 

denote ||/||;2(s) = (l/ n J2xes f 2 ( x )) > wnere n — Card(£), and by Np(e, S,F) the smallest integer m such 
that there exists <f>\, . . . , </> m , m functions on X satisfying mini = i i ... im |/ — 4>i\\i 2 (S) e2 |l-^llp(s)- Also define 
H(e) = sup s<oc logA F (e,5, F). 

According to [16] and [15l Theorem 7], for the class of functions 

F = |p(.,c*,c-c*),c* eM, ||c — c* || < v^}, 

there exists C > depending on k and d such that F(x) = C(l + ||a;||) is an envelope for F. Furthermore, 
for this envelope, we have 

fr(e)<log(A)-Wlog(e), 

where A is a positive constant, and W depends only on the pseudo-dimension of F, in a way which will 
not be described here (see the result of Pollard |15[ Theorem7]). We will use a classical chaining argument 
to bound term B. Let c denote the pair (c, c*) £ (6(0, l)) fc x A4. For practical, let /c denote the function 
P(.,c*,c- c*). We set e = 1 and ej = 2~ j e . 
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For any /g, let /g^ be a function such that ||/ — ftjWp^ < ^ll-Pllfafs)) for every finite set S. Since 
Assumption 1 holds, F is bounded from above by a constant Cp. By dominated convergence Theorem we 

have fc . — 4 /g , and thus 

oo 

(Pn - P)h = (Pn - P)/cb + - P )(^ 3 - VJ- 

4=1 

Therefore 

E ( sup (P„ - P)/g j < E ( sup (P„ - P)/g ) 

\c*e.A/l,||c-c* |<\/<5 / \c*eA4,||c-c* ||<\/<5 / 

+ E E ( SU P (Pn-P)(fc j -fc j . 1 ))- 

j>o \c*eM,\\c— c*||<V5 / 

Using a symmetrization inequality and introducing some Rademacher random variables a (cr = ±1 with 
probability 1/2), we get, for the first term: 



E sup 

Vc*eA't,||c-c*||<\/5 



(Pn ~ P)h ) < 2E x E a ( sup - a, fco (xA 

< 2^2E X [ 4 /sup||/ £o || 2 L2(Pn) log(m(eo)) ) 



< 2V2E X ^\\F\\l 2{Pn) log(m(s ))) 

< 2V2E X ^C F \og{m{e )) 

< HA. 

where ka depends on fc, d and P. In the second line of this inequality, we used the maximal inequality for 
random processes depending only on Rademacher variables given by Massart [HI part 6.1]. It remains to 
bound the second term. Using the same approach (symmetrization and maximal inequality for Rademacher 
variables) we get, for every j > 0, 

E (sup (P n - P)(f- Cj - / ej _j) < 2E X ^ logM^M^)) sup \\f~ C] - h 3 _ t ||| 2(p ^ . 



However - h\\ L *(p n ) < £j\\F\\L*(P n ), consequently 

< ^fsU 



■j-illL 2 (P„) — ^^i-lll- 1 lli 2 (P„) 
2 _2 



Comparing a sum with an integral, we obtain 



E E ( SU P ( P « - P )(4 - ] ^ ^= f 1 Vlog(m( £ ))de, 

j>0 \ c * e - M 'll c - c *ll<v / J / V n JO 

which, by assumption on m(e), can be bounded from above by where kb depends on fc, d and P. 
We are now in position to prove Proposition 5.1. From the two above subsections we deduce that 
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$(<*)= E sup (P n -P)( 7 (c*,.)-7(c,0) 

\c'-eM,\\c-c'-\\ 2 <A 2 6 / 

<Vs^=. 



This concludes the proof. 



5.4. Proof of Theorem 3.3 



Let x = (xi, . . . ,Xk) be a k x d vector, Vi,...,Vk the Voronoi diagram associated with an optimal vector 
of clusters c*. We state here a sufficient condition for the Hessian matrix H(c*) to be positive. Denote 



ri,j = || c* - c*||. It holds 



where, for alH = 1, . . . , k, 



(Pfx,x) = ^ 



1=1 



{HijXi,Xi) + ^ ( H i,j x j> x i) 



(H iti x i ,Xi)+y2(H iJ x j ,x i )= 2P(Vi)\\xi\\ 2 -2x\\Y j rr] f f{u){u-c*){u-c*) t du)x i 

+ 2x1 E %i (I f («) ( u - < ) ( u - c *i fA x i ■ 

The support of P is included in 23(0, 1), thus we can replace d(Vi n Vj) with d(Vi n Vj) n 23(0, 1) in the 
equations above. However, to lighten notation we will omit the indication and implicitly assume that every 
set we consider is contained in 23(0, 1). Let pij = J d ^ V nV ^ f{u)du be the d — 1-dimensional P-measure of the 
boundary between Vi and Vj. Recalling that the underlying norm is the Euclidean norm, even for matrixes, 
we may write 



(Hi^xuXi) +J2(H itj x j ,x i ) > 2P(V i )\\x i \\ 2 - 2\\ Xi 



E^/ / f{u){u-c*){u-c*f 



du 



-2\\xi\ 



S 1,0 (^(Vinvj 



f(u)(u - c*)(u - c*fdu x 3 



with 



E r u(7 /(«)( 



u — c*)(u — c*Ydu x 



<E^ 



M — C*Ydu Xj 



< 



(7 /(«)(«-<*)( 

E r -/ ( / /(«)||«-cf||||«- c ;i|d«] 



^ E r J p « 4 iNi- 
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Next, 

(H iti Xi,Xi) + ^2{HijXj,Xi) > I 2P{Vi) - j^^Pij ) \\xi\ 

B J2Pi,j I I ■'•'•II IK; I 

where we recall that ||ari||||xj|| < 2 (||xi|| 2 + ||x,|| 2 ) and B = inf ||c* — c*||. Summing with respect to i 
leads to 



2=1 \ j^l 



The last step is to derive bounds for pij from conditions on /. Denote A = ||/||oo, we see that 

E?y = / f{u)du. 

i# Jdv > 

Vi is a regular convex set included in B{c* 1 2). Therefore, by a direct application of Stokes Theorem, the 
surface of dVi is smaller than the surface of Sd-i{c*, 2) (the sphere of radius 2). Consequently 

It follows that A < ^+i d/ l/2 inf P(Vi) is enough to ensure that the Hessian matrix H(c*) is positive 

"• i=i,..., 

definite. 



5.5. Proof of Proposition 4-4 

We take a distribution uniformly distributed over small balls far one from each others. Denote by V, the 
Voronoi cell associated with Zi in (z\, . . . ,Zk)- Let Q be a fc-quantizer, Q* the expected optimal quantizer 
which maps Vi to for all i. Denote finally, for alH = 1, . . . , k, Ri(Q) = J v , \\x — Q(x)\\ 2 dx the contribution 
of the i-th Voronoi cell to the risk of Q. 
First we compute 

p 



^ = Wvf Srd+Hr 



p 



fc(rf + 2)' 

where S and V are the unit surface and the volume of the unit ball in M. d . 

Let i be an integer between 1 and k. Let m™ = \Q(Bd(zi, p)) n Vi\ be the number of images of Vi sent 
by Q inside Vi, and let m° ut = \Q(Bd(zi, pj) n V^ c | be the number of images of Vi sent outside Vi. The three 
situations of interest are the following ones: 

If mf" = 1 and m»"' = 0, it is clear that R^Q) > Ri(Q*). 
-> If mf > 2 and = 0, then we just can see that R^Q) > R, L (Q*) - jg^j = °- 
— > At last, suppose that m° ut > 1. Then there exist x e Bd{zi,p) such that 

IIQ(x) — xll < inf liar — ell 

c€0(Bd(ai,p)) 

||Q(x)-x|| >d( Zl ,vn- P >^-p 
14 



Let c e Q(Bd(zi,p)). Then 

\\c-Zi\\ > Hc-arll-p 

> \\Q(x)-x\\-p 

>f-v 

Then, we deduce that, for every y e Bd(zi, p) and c e Q{Bd{zi 1 p)), \\y — c\\ > § — 3p. Therefore 

*«) > 



Now suppose that > 2. Then at least two clusters of Q lies in Vi. Therefore, there exists j such that no 
cluster of Q lies in Vj, so that m? ut > 1. We straightforward deduce that the number of cells Vi for which 

m- n > 2 is smaller than the number of cells for which m° ut > 1. 
1 j 

Taking into account all contributions of Voronoi cells, we get 



{t;raj">2,m™'=0} {i;m™'>l} {i;mj" = l,m™'=0} 

{i;m*">2,m°"*=0} \ x 7 / 

from which we deduce a sufficient condition to get R(Q) > R(Q*). 
5.6. Proof of Proposition 4-3 

Using the same method as in the proof of Proposition 4.4, we prove that, for 77 = 2 and R = 10, the optimal 
vector of clusters is ^ + R, | + 2i?). Thus the density / is zero- valued on each boundary of every Voronoi 
cell of the optimal vector of centroids. Consequently Assumption 2 is satisfied. For n large enough, 

P'y(c n ,.)= — / x 2 dx H / x 2 dx + q / x 2 e~ x dx 

3?7 Jo 3?7 7 fl J v +2R-i 

„2, 

f - _ / >+0 ° 

+ q (x-n)e~ x dx + q I (x - n 2 ) 



2 e x dx. 



Hence, by the dominated convergence Theorem for the three first terms of the right-hand side and through 
computation for the remaining terms, P-y(c n , .) — > P||x|| 2 . 

n too 

5.7. Proof of Proposition 4-5 

We begin with a lemma which ensures that every possible optimal centroid c* is close to at least one mean 
rrij of the mixture when the ration Pmin/Pmax is large enough. 

Lemma 5.3. Let c* be an optimal vector of clusters. Suppose that 



p mm 288/ccr 



2 



JW ~ (l-e)S 2 (l-e- B2 /288^) 



Then, for every j = 1, . . . , k there exists i £ {1, . . . k} such that \\nij — c*\\ < . 
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Proof of Lemma 5.3. Denote by m the vector of clusters (mi, . . . , TOfc), and by Mj the Voronoi cell associated 
with rrii. We bound from above the quantity Pj(m, .): 

P7(m,0 = g^/ jfi ||«-„ ( ||V"-~"^ (i , 



< 



= 1 

2kp raax o 



> 144 



B p 3 / ii- 



> 2887ra 2 7V, ./ B(m .,B /12) 6 



»' ir/2<7 ^ 



Let c be a vector of clusters such that there exists j satisfying, for alH = 1, . . . , k, \\rrij — Cj|| > B/6. We 
will prove that Pj(c, .) > P7(m, .), which implies that c <^ AL In fact we have, for alH = 1, . . . , k and for 
all x € B(rrij, B/12), — Cj|| > B/12. Hence, a lower bound for P"/(c, .) is 

Pj( c i •) > / rnin II s — Ci\\ 2 f(x)dx 

V-4- / e -n™ ll 2 /2a 2 dx 

^2w 2 7V 4 7 e(m3!i j/i2) 

PminB 2 t _ -B 2 /288(T 2N \ 

> 144 V 1 ) 

> P7(m,.)- 

Hence we deduce that every optimal vector of clusters has a centroid close to every mean nij of the mixture, 
of at most B/6. □ 

Suppose that the ratio p m in/Pmax satisfies the assumption of Proposition 4.5. In particular p m in/pmax 
satisfies the assumption of Lemma 5.3. Then we deduce that, up to a reindexation, for every c* £ A4, 
\\c* - m^l < B/6. We conclude that 2B/3 < B < 4B/3. 

Since, for alH = 1, . . . , k, B{c*,B/2) C V* , it is easy to see that B(m i: B/4) C B{c*,B/2) C V* , which 

( k \° 

leads to N* C U B{rrn, B/4) I . Consequently, in order to apply Theorem 3.2, we just have to prove that 



11/ 



Jt./ {B{mhBIA)) - 



U B(m,,B/4) 

First we derive a lower bound for the right-hand side. For every i = 1, . . . , k, 

P(B(m t ,B/4))>^-^ f e-^dx 

Pl 1 f B/i 

> o x 2-7T / re 2^dr 

~ Ni 2™ 2 J 

> Pmin (l - e-^t 1 ^ . 

Then, we deal with the left-hand side. Let x be at distance from every rrn of at least B/4. Then 

l—l 

kp max — 

< e 32<r 2 . 

- 2na 2 (l-e) 
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The rest of the proof follows from straightforward computation, using the assumption of Proposition 4.5 and 
the relationship between B and B: 2B/3 < B < 4B/3. 

Remark A careful reader should have noticed that the k factor is suboptimal in the previous inequality. 

B 2 

In fact we are able in this case to bound from above f(x) with 2 ircr' 2 \i-e) e ~ 32g2 • However, this bound does 
not involve p m ax, and so involve a condition not on the ratio of extremal proportions of the mixture, but 
rather on the minimal proportion of the mixture, which is less natural. Moreover, the p max -iree bound is 
valid only in the equal variance case, (ie), when the variance of of any element of the mixture is the same. 
In general it is not the case and a condition as in Proposition 4.5 for that kind of mixture would naturally 
involve the ratio p m in/Pmax- 
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